# Connection to MGnify API
from jsonapi_client import Session as APISession
from jsonapi_client import Modifier
import requests
# Dataframes and display
import pandas as pd
'display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option(
# Data transformation
from functools import reduce
# Plots
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline
# Create signature of MAGs for comparison against database
import sourmash
import glob
import time
from pathlib import PurePath as pp
from Bio import SeqIO
# Warning verbosity
import warnings
="ignore") warnings.filterwarnings(action
Search MGnify Genomes
You can run and edit these examples interactively on Galaxy
How to use the “genome search” for MGnify MAGs
This notebook aims to give an non-exhaustive overview on how to use the genome search API for MGnify MAGs.
This notebook is divided in 5 sections: - 1: Libraries needed to run the full notebook - 2: How to query the genomes
database from the MGnify API, save and reload the results - 3: Explore the whole genome dataset - 4: Graphical representations - 5: Case study: Find out whether your own MAGs are novel compared to the MGnify catalogues
This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter
.
Import libraries
Here are imported the necessary python libraries required to run the examples presented in this notebook.
Query the genomes
database from MGnify API
For the genome dataset, use genomes
endpoint.
A complete list of endpoints can be found at https://www.ebi.ac.uk/metagenomics/api/v1/.
= 'genomes' endpoint_name
The function Modifier
from the jsonapi_client
module allows us to query for specific values in given fields (e.g.: ‘geographic-origin’, ‘taxon-lineage’).
One way to explore the available fields is to use requests
to fetch the API endpoint:
= requests.get(f"https://www.ebi.ac.uk/metagenomics/api/v1/{endpoint_name}") r
'data'][0] r.json()[
Get information for a specific genus or species
Examples: Search for available ressources for a specific genus or species of interest.
- Listeria - Listeria monocytogenes
The taxon-lineage
field contains domain, phylum, class, order, family, genus, species, subspecies as d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria;s__Listeria monocytogenes
(example for Listeria monocytogenes).
The filter can use the full lineage d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria
or only part of it g__Listeria
or Listeria
.
Set the desired filter(s)
= 'Listeria'
genus_filter = 'Listeria monocytogenes' species_filter
Query the database with the ‘Listeria’ filter and store the results in a Pandas DataFrame.
with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
= Modifier(f"taxon_lineage={genus_filter}")
search_filter = map(lambda r: r.json, mgnify.iterate(endpoint_name, filter=search_filter))
resources = pd.json_normalize(resources) resources_df
# Display the table containing the results of the query
resources_df
Output: Four entries are obtained for Listeria genus.
Query the database with the ‘Listeria monocytogenes’ filter and store the results in a Pandas DataFrame.
with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
= Modifier(f"taxon_lineage={species_filter}")
search_filter_2 = map(lambda r: r.json, mgnify.iterate(endpoint_name, filter=search_filter_2))
resources_2 = pd.json_normalize(resources_2) resources_df_2
# Display the table containing the results of the query
resources_df_2
Output: Two entries are obtained for Listeria monocytogenes species.
Save the results as parquet.
Pandas allow you to save your query results in multiple formats. (See Pandas documentation for more options).
parquet
files can be opened with different libraries including (pandas, pyspark, polar…) and has the advantage of being a compressed data format with column types preserved.
'Listeria_resources.parquet') resources_df.to_parquet(
Load previously saved parquet files
= pd.read_parquet('Listeria_resources.parquet') listeria_df
listeria_df
Output: Display the saved data as a pandas DataFrame.
Explore the whole genomes
dataset.
Query and save the dataset as parquet file
To query the whole dataset, we can use the same method as previously. The only difference is that no filter is passed to the query.
Warning: Querying without filter is computationally expensive and will take time.
A pre-fetched copy of the data (as of 8 November 2022) is available in ../example-data/genomes/all_genome_resources.parquet
.
This was fetched using the following code, which you could copy-and-paste into a code cell to re-fetch the latest data:
with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
= map(lambda r: r.json, mgnify.iterate(endpoint_name))
resources_all = pd.json_normalize(resources_all)
resources_all_df
resources_all_df'latest_genome_resources.parquet') resources_all_df.to_parquet(
Load the dataset file with another python library
Here is an example to show that the saved parquet files can be loaded with different python libraries (some being more suitable for loading and transforming large amount of data). Of course, this file can also be loaded with pandas (see section 1.3.3. of this notebook).
For this example, we use pyspark. For help, see the documentation: https://spark.apache.org/docs/latest/api/python/.
Pyspark require to start a SparkSession
before use. A basic example is shown in the section below.
Start Spark Session
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
= SparkSession.builder.getOrCreate() spark
Load the data from the parquet file
= spark.read.parquet('../example-data/genomes/all_genome_resources.parquet') all_genomes_df
Get the shape of the DataFrame
# To get the number of rows
all_genomes_df.count()
# To get the number of columns
len(all_genomes_df.columns)
Outputs: The dataframe has 9421 rows and 37 columns.
Get stats when applicable
PySpark can show statistics of the data in the parquet dataframe:
=False, vertical=True) all_genomes_df.describe().show(truncate
Ouputs: For example: - the first record gives the number of non-null values for each column. - gc-content on the whole dataset is 47.951392633478235 ± 10.364966184467969 % with a min value of 22.94% and a max value of 74.84% (if you are using the example dataset)
Note: When using Spark, column names containing ' '
are translated to e.g. column.name
# To visualise only gc-content:
'`attributes.gc-content`').describe().show() all_genomes_df.select(
Get the most represented genus in the dataset
# To see a sample of taxon-lineages present in the dataset:
f'`attributes.taxon-lineage`').show(truncate=False) all_genomes_df.select(
# The total number of genomes in the dataset:
'`id`').distinct().count() all_genomes_df.select(
# The number of distinct lineages:
'`attributes.taxon-lineage`').distinct().count() all_genomes_df.select(
Split the taxon-lineage column into 7 columns
= ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species'] features
= reduce(lambda df, i: df.withColumn(features[i], F.col('lineage_split')[i]),
all_genomes_tax_df range(len(features)),
'lineage_split', F.split(F.col('`attributes.taxon-lineage`'), ';')),
all_genomes_df.withColumn( )
=5) all_genomes_tax_df.select(features).show(n
Query examples on taxon-lineage column
To search the most represented taxon:
'`attributes.taxon-lineage`').count().filter(F.col('count')>100).show(truncate=False) all_genomes_tax_df.groupby(
Output: Collinsella genus with no species annotated is the most represented taxon in the dataset (if you are using the example dataset)
To search for a particular lineage and count how many times it appears:
filter(F.col('`attributes.taxon-lineage`').startswith('d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella')).count() all_genomes_tax_df.
To search for a particular genus and count how many times it appears:
filter(F.col('`attributes.taxon-lineage`').contains('Collinsella')).count() all_genomes_tax_df.
Query examples on newly created columns:
To search for the most or least represented genus, species, … in this dataset for example. The search is more flexible than for the full taxon.
'genus').count().filter(F.col('count')>100).show() all_genomes_tax_df.groupby(
Output: As previously, the genus Collinsella is the most represented with 584 entries (if you are using the example dataset)
By splitting the taxon into several columns, we can see that the genus Prevotella, RC9 are among the most represented in this dataset. Also, 325 of the taxons in the database don’t have a genus. The reason why those taxons didn’t appear in the query on the taxon-lineage
column is that the number of different species indexed for each genus compared to the number of hints calculated are proportionaly higher than for Collinsella genus.
filter(F.col('genus').isin('g__Prevotella', 'g__RC9', 'g__Collinsella')).groupby('genus').agg(F.countDistinct('species')).show() all_genomes_tax_df.
# To see some of the Collinsella species in the dataset:
filter(F.col('genus')=='g__Collinsella').select('species').distinct().show(truncate=False) all_genomes_tax_df.
Graphical representation
This section contains examples of graphical representations that could be obtained from this dataset.
all_genomes_tax_df.count()
Representation of the bacterial taxon-lineage
This graphic gives an idea of the proportional representation of each taxon in the dataset.
List of the features used for the graphical representation: (see section 1.4.4.1 where the feature variable is defined)
features
f'{features[i]}_count') for i, x in enumerate([*features])]).show() all_genomes_tax_df.select([F.count_distinct(x).alias(
Output: Count of the distinct values for each categories.
Create a Sankey diagram to visually portary this Note: Colours and node size can be changed in the code below.
def get_sankey(df, cat_cols=[], value_cols='', title='Sankey Diagram'):
# Colors
= ['rgba(31, 119, 180, 0.8)',
colorPalette 'rgba(255, 127, 14, 0.8)',
'rgba(44, 160, 44, 0.8)',
'rgba(214, 39, 40, 0.8)',
'rgba(148, 103, 189, 0.8)',
'rgba(140, 86, 75, 0.8)',
'rgba(227, 119, 194, 0.8)',
'rgba(127, 127, 127, 0.8)']
= []
labelList = []
colorNumList for catCol in cat_cols:
= list(set(df[catCol].values))
labelListTemp len(labelListTemp))
colorNumList.append(= labelList + labelListTemp
labelList
# remove duplicates from labelList
= list(dict.fromkeys(labelList))
labelList
# define colors based on number of levels
= []
colorList for idx, colorNum in enumerate(colorNumList):
= colorList + [colorPalette[idx]]*colorNum
colorList
# transform df into a source-target pair
for i in range(len(cat_cols)-1):
if i==0:
= df[[cat_cols[i],cat_cols[i+1],value_cols]]
sourceTargetDf = ['source','target','count']
sourceTargetDf.columns else:
= df[[cat_cols[i],cat_cols[i+1],value_cols]]
tempDf = ['source','target','count']
tempDf.columns = pd.concat([sourceTargetDf,tempDf])
sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
sourceTargetDf
# add index for source-target pair
'sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
sourceTargetDf[
# creating data for the sankey diagram
= dict(
data type='sankey',
= dict(
node = 15,
pad = 20,
thickness = dict(
line = "black",
color = 0.5
width
),= labelList,
label = colorList
color
),= dict(
link = sourceTargetDf['sourceID'],
source = sourceTargetDf['targetID'],
target = sourceTargetDf['count']
value
)
)
# override gray link colors with 'source' colors
= 0.4
opacity # change 'magenta' to its 'rgba' value to add opacity
'node']['color'] = ['rgba(255,0,255, 0.8)' if color == "magenta" else color for color in data['node']['color']]
data['link']['color'] = [data['node']['color'][src].replace("0.8", str(opacity))
data[for src in data['link']['source']]
= go.Figure(data=[go.Sankey(
fig # Define nodes
= dict(
node = 15,
pad = 15,
thickness = dict(color = "black", width = 0.5),
line = data['node']['label'],
label = data['node']['color']
color
),# Add links
= dict(
link = data['link']['source'],
source = data['link']['target'],
target = data['link']['value'],
value = data['link']['color']
color
))])
=title, font_size=10)
fig.update_layout(title_text
return fig.show(renderer='iframe')
Taxon-lineage represented in Genomes dataset
# Convert Spark DataFrame to Pandas DataFrame:
= all_genomes_tax_df.select(features).groupby(features).count().toPandas() pdf
# Title and number of displayed features can be modified
=features[0:4], value_cols='count', title='Taxon-lineage represented in Genomes dataset') get_sankey(pdf, cat_cols
Representation of a sample of the genomes present in the dataset: example from the order of the Lactobacillales.
= all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').select(features).groupby(features).count().toPandas() pdf_lactobacillales
= get_sankey(pdf_lactobacillales,cat_cols=features[0:6], value_cols='count',title='Genomes from the Lactobacillales order') fig_l
# Note that there are too many distinct species in the Lactobacillales order to show individually:
filter(F.col('order')=='o__Lactobacillales').select('species').distinct().count() all_genomes_tax_df.
Information such as genome length or GC-content can also be represented
We can group and visualise these at different levels like family, genus, species… depending on the number of sequences available and on the biological significance.
= all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').orderBy('family').toPandas()
lactobacillales_df = all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').groupby('family').count().orderBy('family').toPandas() lactobacillales_count
lactobacillales_count
= plt.figure(figsize=(10, 10), layout="constrained")
fig = fig.add_gridspec(3, 1)
spec
= fig.add_subplot(spec[0, 0])
ax00 =lactobacillales_count, x='family', y='count')
sns.barplot(data"Number of genome available")
plt.ylabel(
= fig.add_subplot(spec[1, 0])
ax10 =lactobacillales_df, x='family', y='attributes.length')
sns.boxplot(data"Genome length (bp)")
plt.ylabel(#plt.xlabel("Family of the Lactobacillales order")
= fig.add_subplot(spec[2, 0])
ax20 =lactobacillales_df, x='family', y='attributes.gc-content')
sns.boxplot(data"GC-content (%)")
plt.ylabel("Family of the Lactobacillales order")
plt.xlabel(
'Number of genomes avalaible, genome length and GC-content of bacteria belonging the Lactobacillales order'); fig.suptitle(
= plt.figure(figsize=(20, 5))
fig = fig.add_gridspec(1, 2)
spec
#ax00 = fig.add_subplot(spec[0, 0])
#lactobacillales_df['relationships.biome.data.id'].hist()
#plt.xlabel("Biome")
= fig.add_subplot(spec[0:])
ax01 'relationships.catalogue.data.id'].hist()
lactobacillales_df["Catalogue")
plt.xlabel(False)
ax01.grid(
'Biome and Catalogue related to bacteria belonging the Lactobacillales order'); fig.suptitle(
Another example: produce a quality control figure similar to Extended Data Fig. 4a of Almeida et al 2020
= all_genomes_tax_df.toPandas() qc_df
Statistical informations can be obtained with the describe
function… (see section 1.4.3)
'attributes.completeness', 'attributes.contamination']].describe() qc_df[[
… and visualised in plots:
= plt.figure(figsize=(5, 10), layout="constrained")
fig = fig.add_gridspec(1, 1)
spec
= fig.add_subplot(spec[0, 0])
ax00 =qc_df[['attributes.completeness', 'attributes.contamination']])
sns.boxplot(data"%")
plt.ylabel(
'Quality of genomes avalaible'); fig.suptitle(
Find out whether your own MAGs are novel compared to the MGnify catalogues
Another use for the MGnify genomes resource is to query your own MAG against MGnify’s MAG catalogues, to see whether they are novel or already represented.
List directories of the files to be analysed:
Replace the str with your own path to folder containing your files. *
allows to query all the file with the .fa
extension.
= glob.glob('../example-data/genomes/*.fa') files
files
Compute a sourmash sketch for each MAG
Create “sketches” for each MAG using Sourmash
A sketch goes into a signature, that we will use for searching. The signature is a sort of collection of hashes that are well suited for calculating the containment of your MAGs within the catalogue’s MAGs.
for mag in files:
# The sourmash parameters are chosen to match those used within MGnify
= sourmash.MinHash(n=0, ksize=31, scaled=1000)
sketch
# A fasta file may have multiple records in it. Add them all to the sourmash signature.
for index, record in enumerate(SeqIO.parse(mag, 'fasta')):
str(record.seq))
sketch.add_sequence(
# Save the sourmash sketch as a "signature" file
= sourmash.SourmashSignature(sketch, name=record.name)
signature with open(pp(pp(mag).name).stem + '.sig', 'wt') as fp:
sourmash.save_signatures([signature], fp)
Fetch all of the catalogue IDs currently available on MGnify
To fetch the catalogue IDs
to the MGnify API, use the following endpoint: https://www.ebi.ac.uk/metagenomics/api/v1/genome-catalogues
.
= "genome-catalogues" catalogue_endpoint
with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
= map(lambda r: r.json, mgnify.iterate(catalogue_endpoint))
catalogues = pd.json_normalize(catalogues) catalogues
= list(catalogues['id'])
catalogue_ids catalogue_ids
Submit a search job to the MGnify API
Tosubmit a job
to the MGnify API, use the following endpoint: https://www.ebi.ac.uk/metagenomics/api/v1/genomes-search/gather
.
Data will be send to the API, which is called “POST”ing data in the API world.
This part of the API is quite specialized and so is not a formal JSON:API, the requests
Python packageìs therefore used to communicate with it.
= 'https://www.ebi.ac.uk/metagenomics/api/v1/genomes-search/gather' endpoint
# Create a list of file uploads, and attach them to the API request
= [open(sig, 'rb') for sig in glob.glob('*.sig')]
signatures = [('file_uploaded', signature) for signature in signatures]
sketch_uploads
# Send the API request - it specifies which catalogue to search against and attaches all of the signature files.
= requests.post(endpoint, data={'mag_catalogues': catalogue_ids}, files=sketch_uploads).json()
submitted_job
map(lambda fp: fp.close(), signatures) # tidy up open file pointers
print(submitted_job)
Wait for the results to be ready
As you can see in the printed submitted_job
above, a status_URL was returned in the response from submitting the job via the API. Since the job is in a queue, this status_URL must be polled to wait for our job to be completed.
Below is an example to check every 2 seconds until ALL of the jobs are finished. The time can be easily change (to 10s in the example below) by setting a different sleeping value:
10) time.sleep(
= False
job_done while not job_done:
print('Checking status...')
# The status_URL is another API endpoint that's unique for the submitted search job
= None
query_result
while not query_result:
= requests.get(submitted_job['data']['status_URL'])
query_result print('Still waiting for jobs to complete. Current status of jobs')
print('Will check again in 2 seconds')
2)
time.sleep(
= {sig['job_id']: sig['status'] for sig in query_result.json()['data']['signatures']}
queries_status = all(map(lambda q: q == 'SUCCESS', queries_status.values()))
job_done
print('Job done!')
The query_result
contains the results of the query.
The results can be visualised as json (try entering query_results.json()
), or as a Pandas dataframe:
= pd.json_normalize(query_result.json()['data']['signatures']) query_result_df
query_result_df
Output: Each signature is queried against each catalogue entry.
Results can then be analysed according to your research goal, for example looking for representation in Host-Associated biome catalogues vs. Environmental ones.
Are any of our MAGs found in biomes other than the human gut?
= query_result_df.dropna(subset=['result.match']) matches
matches
Output: DataFrame containing only the results with a positive match in the catalogue.
= plt.figure(figsize=(3, 3))
fig = fig.add_gridspec(1, 2)
spec
= fig.add_subplot(spec[0:])
ax01
matches.catalogue.hist()"Catalogue")
plt.xlabel(False)
ax01.grid(
'Biomes matched by the MAGs'); fig.suptitle(
Output: Display which biomes that were matched during the search by the MAGs. The five signatures match various numbers of genomes in the human gut, cow rumen, and pig gut catalogues.
What is the taxonomy of the MGnify MAGs which match the query?
Call the API for each Genome, to find it’s taxonomic lineage.
def get_taxonomy_of_mgnify_mag(match_row):
= match_row['result.match']
mgyg_accession with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
= mgnify.get('genomes', mgyg_accession)
genome_document return genome_document.resource.taxon_lineage
Create a new column in the matches the table.
'best_match_taxonomy'] = matches.apply(get_taxonomy_of_mgnify_mag, axis=1) matches[
matches
Output: The matches
DataFrame has a new column that contains the corresponding taxon-lineage
information.
Summarise results with more verbose:
for row, match in matches.iterrows():
print(f"The MAG ({match['filename']}) matches {match['result.match']} which has taxonomy {match['best_match_taxonomy']}")
Which of our MAGs are completely novel (i.e. in no MGnify catalogue)
One way to check this is to group all of the search results by filename (i.e. finding the queries for each MAG vs all catalogues) and checking whether the sum of all matches is 0…
'filename').apply(lambda query: query['result.matches'].sum() == 0) query_result_df.groupby(
If you’re using the example dataset for this example, none of the MAGs are completely novel.